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The sequencing of entire genomes is a 
major achievement, but the meaning of 
the mass of accumulated data is only 
just beginning to be unraveled At first sight, 
the task appears straightforward: locate the 
genes and translate the coding regions to es- 
tablish their protein products; perform simi- 
larity searches to establish relationships with 
previously characterized sequences and as- 
sign function by evolutionary inference; and 
rationalize the function in structural terms us- 
ing known or model-derived structures. Giv- 
en the quantity of data, the procedures should 
be automated as much as possible. 

The reality, of course, is not so simple. At- 
tempts to decipher the clues latent in genom- 
ic data are hampered because current meth- 
ods to predict genes in uncharacterized DNA 
are unreliable (and it is not always clear what 
we mean by "gene"); it is presumptuous to 
make functional assignments merely on the 
basis of some degree of similarity between 
sequences (and it is not always clear what we 
mean by "runction"); very few structures are 
known compared with the number of se- 
quences, and structure prediction methods 
are unreliable (and knowing structure does 
not inherently tell us function); the degreeof 
automation that has been used of necessity, 
with its imperfect tools and protocols, has led 
to the accumulation of much database misin- 
formation; and the terminology has been im- 
precise, muddying perceptions of what can 
realistically be achieved. Given these prob- 
lems, what is the state of the art in seqoence- 
structuie-flmction bioinformatics? 



Gene prediction 

Information used to predict genes includes 
signals in the sequence, content statistics, 
and siniilarity to known genes. In a recent 
test of gene detection tools on part of the 
Drvsophila genome, the majority of these 
"gene finders" identified 95% of coding 
nucleotides, but intron/exon structures 
were correctly predicted for only about 
40% of genes. The different methods failed 
to find between 5% and 95% of genes, and 
incorrectly identified up to 55% (7). But 
probably the most sobering evidence of the 
frailty of gene prediction methods is the 
uncertainty in the number of genes in the 
human genome, with current estimates 
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ranging from 27,462 to 312,278. The meth- 
ods used to arrive at these numbers each 
involve different approximations and ex- 
trapolations. Nevertheless, it is disturbing 
that the different analytical approaches 
should yield such disparate results. 

What is a gene? 

Perhaps the biggest obstacle to accurate 
gene counting is that even the definition of 
a gene is unclear. Is it a heritable unit cor- 
responding to an observable phenotype? 
Or is it a packet of genetic information 
that encodes a protein, or proteins? Or per- 
haps one that encodes RNA? Must it be 
translated? Are genes genes if they are not 
expressed? As defmitions.vary, inevitably 
so do estimates of the total number of 
genes in sequenced genomes. 

A Sequence? ; 



The^Kiuence-structure imbalance 

To date, more than 540,000 protein sequences 
have been deposited in the nonredundant 
database maintained by the National Center 
for Biotechnology Information (NCBl), and 
millions of expressed sequence tags (ESTs), 
which are partial sequences of clones that are 
often error prone, are housed in public and 
proprietary repositories. These numbers will 
snowball with the fruition of further genome 
projects. By contrast, the number of unique 
protein structures is still less than 2000. Of 
course, we do not know how many unique se- 
quences there are; nevertheless, it is clear that 
there is a dearth of structural information. 

Given this sequence-structure imbalance, 
it is imperative that we focus on deciphering 
the structural, functional, and evolutionary 
clues encoded in the language of biological 
sequences. Two distinct analytical approach- 
es have emerged Pattern recognition meth- 
ods aim to detect similarity between se- 
quences and structures and infer related 
functions. Thus, they require some charac- 
teristic to have been observed and deposited 
in a reference database. In contrast, ab initio 



Repetitive 



^ r ,T^ eBWB fsnp.SPGaNRYPPOGGGGWCQPH 



GEE 



Very low complexity 



ct =FFFFFFFE EEEPEDEEDEDDMEEDEODgEEOEOO 



rt . H , H .i 8 ij,i,UH J»uuiJ.iJ«iJ»M»W^^ 




Calcium binding 



Nucleotide binding 





Membrane anchoring 
mlaaAmBllivlS 



MLAAj 



3LLIML3 



MLAAj 



3LLIVL5 



|LALpLH3LMgVtS 
' ALAApMELU LLS 
ALAApM5*ULLa 

alaaQmEBulvB 



faS a $ providing different scaffolds, wh ich ; «" J* 

ouences to confer different functions. Knowing both the tow ana we lunuwu a 



www3dencemag.org SCIENCE VOL 290 20 OCTOBER 2000 



471 




prediction methods deduce structure direct© 
from sequence. The approaches are quite 
different and should not be confused. Their 
levels of success also differ markedly. 



Function prediction through pattern 
recognition , 

Tools for similarity searching are standard 
components of the sequence annotator's ar- 
mory. Sequence similarity programs may seek 
oairwise similarities in large sequence reposi- 
tories or search for conserved patterns in gene 
family databases (2-5). Gene famiry databas- 
es allow more specific functional diagnoses to 
be made than is possible by pairwise search- 
ine They are based on the principle that relat- 
ed sequences can be aligned to find regions 
/motifs) mat show little variation. These mo- 
tifs usually reflect some vital structural or 
functional role (see the figure), and they can 
be used to derive diagnostic family signatures. 
Sequences can then be searched against 
databases of such signatures to see whether 
they can be assigned to known families. Gene 
family databases have recently been integrat- 
ed to create a unified protein family resource 
«$) facilitating the inference of function by 
identifying homologous relationships. 

The term "homology," a fundamental con- 
cept in bioinformatics, is often used incor- 
rectly Sequences are homologous if they are 
related by divergence from a common ances- 
tor (7). Conversely, analogy relates to the ac- 
quisition of common structural or functional 
features via convergent evolution from unre- 
lated ancestors. For example, fi barrels occur 
in soluble serine proteases and integral mem- 
brane porins, but despite their common archi- 
tecture, they share no sequence or tuncuonal 
similarity. Similarly, the enzymes chy- 
motrypsin and subtilisin share groups ot cat- 
alytic residues with almost identical spatial 
geometries, but they have no other sequence 
or structural similarities. Homology is not a 
measure of similarity, but rattier an absolute 
statement that sequences have a divergent 
rather than a convergent relationship. This is 
not just a semantic issue because imprecise 
use of the term obscures evolutionary rela- 
tionships. In comparing structures, the same 
arguments apply. Structures may be similar, 
but common evolutionary origin remains a 
hypothesis until supported by other evidence; 
thehypothesis may be correct or mistaken, 
but the similarity is a fact(S). 

Among homologous sequences, we can 
distinguish orthologs (proteins that usually 
perform the same function in different 
species) and paralogs (proteins that perform 
different but related functions within one or- 
ganism). Orthologs allow investigation ot 
cross-species relationships, whereas paralogs, 
which arise via gene duplication events, shed 
light on underlying evolutionary mechanisms 
because the duplicated genes follow separate 
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mator module, it is unlikely that the ftinc- 
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In using modules to confer different 
finalities, Nature uses old matenalto 
create new systems. The complexity of such 
SenS poSs important problem, for com- 
pSional approaches because die proper- 
ties of a system can be explained by but not 
deduced from those of its components (9, 
1m Tne presence of a module tells htde of 
tie function of the complete system; know- 
ing most components of a mosaic does not 
Slow us easily to predict a missing one, and 
rnStdifferent proteins do not always 
perform the same function. 

Many other factors also complicate 
function assignment: gene functions may 
be redundant, nonorthologous displace- 
ment can replace genes with unrelated but 
fTctionaf? analogous genes, horizontal 
gTne transfer can introduce genes from 
different phylogenetic taieages-."* ^ 
eage-specific gene loss can eliminate an- 
celral genes. Thus, genomes harbor many 
obstacles to reliable function assignment. 
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Jfcately and that our concepts of "func- 
Wr differ, making function assignment 
tricky It would seem, however, that we can 
aeree on what structures are. They are tan- 
gible, measurable things, so should we not 
be able to predict them reliably? 

Structure prediction methods range 
from computationally intensive strategies 
that simulate the physical and chemical 
forces involved in protein folding to 
knowledge-based approaches that use in- 
formation from structure databases to 
build models. Yet the problem of predict- 
ing protein structure remains unsolved, 
knowledge-based techniques typically pro- 
duce low-resolution models, and no cur- 
rent method yields reliable predictions for 
remote homologs (12). For small proteins 
ab initio methods generate models with 
substantial segments that resemble the cor- 
rect fold, but results deteriorate beyond 
-100 residues. Today, knowledge-based 
methods, especially those that combme in- 
formation from different approaches, give 
best results (13). The most successful 
modeling and fold recognition studies have 
balanced better algorithms with appropri- 
ate levels of manual wtervenhonU 4). 

Prediction methods do not work well 
because we do not fully understand how 
the primary structure of a protein deter- 
mines its tertiary structure. Structural ge- 
S, projects will gradually lessen our 
reliance on prediction, because they am to 
provide experimental structures or models 
forgery protein in all completed genomes 
Slthough membrane protein sfructures 
will be difficult to obtain because they are 
difficult to crystallize). We must keep m 
mind, however, that structure ; alone ^will 
not inherently tell us function (s« : the ^fig- 
Sre). For example, determining Jhe jtruc- 
ture of a hypothetical protem and discover- 
w that it binds ATP (15) may shed Ugh 
on possible aspects of its ta""^** 
such information does not reveal its spe- 
cific biological function. 



What Is structure? 

In the context of fold recognition and ^pre- 
diction it is important to be precise about 
SSle mean by "structure." For exam- 
ple is a prediction a "good" prediction if 
S correcUy reproduces all atomic posi- 
tions, the topology (connectivity of sec- 
ondiiry structures), the architecture (gross 
Sement of secondary structures), or 
merely the structural class (mainly a, 
S j etc.)? Where does a "reasonably 
gc^T prediction fall in this hierarchy, and 
Sat level of structural detail does a 
^ ugh near miss" (16) reveal? Using such 
imorecise words hinders comprehension 
355? difficult to evaluate what a good 
prediction really is. 
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In "predicting" genes, protein functions, 
and structures, it is helpful to define our 
terms precisely and be honest about our 
achievements. Otherwise, we will continue 
to be baffled by paradoxical new prediction 
methods that yield >80% error rates. Gene 
identification, structure prediction, and 
functional inference are nontrivial compu- 
tational tasks, but with the relentless accu- 
mulation of sequence data, improvements 
continue to be made in all areas. 

Nature functions by integration, and the 
adoption of a more holistic view of complex 
biological systems is an essential next step 
for bioinformatics. To get the most from ge- 
nomic data, we need to take account of in- 
formation on the regulation of gene expres- 
sion, metabolic pathways, and signaling cas- 
cades. Proteins do not work in isolation but 
are involved in interrelated networks. Un- 
raveling these networks and their interac- 
tions will be vital to our understanding of 
normal and pathologic cell development, 
and will help us create an integrated map- 
ping between genotype and phenotype. 

Genomics-based drug discovery is 
heavily dependent on accurate functional 
annotation. Toward this end, bioinformatics 
will need to deliver highly integrated, inter- 
operable databases (and data "warehous- 
es") that allow the user to reason over dis- 
parate data sources and ultimately enable 
knowledge-based inference and innovation. 
The more genome annotation is automated, 
the greater will be the need for collabora- 
tion between software developers, annota- 
tors, and experimentalists. And the more 
data we bave to handle, the more rigorous 
we must be in our thinking (and writing) if 
we are to make sense of the complexities. 
Sequence-structure-function bioinformat- 
ics does not yet yield all the answers, but 3 
future holistic approach should help fuse 
today's glimmerings of knowledge into a 
new dawn of understanding. 
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The average personal computer 
spends much less than half a day 
actually performing useful compu- 
tations. Many users, concerned about the 
vulnerability of expensive electronic 
components to the constant 
cycling of the power on and 
off, leave their systems on 
continuously. It is staggering 
to imagine the enormous, un- 
used computing resources of 
several million PCs left run- 
ning unattended. One popular ____ 
approach to tapping this com- 
puting power is the Search for Extra-Ter- 
restrial Intelligence (SETI) project (/), 
which breaks giant computing problems 
into pieces that can be solved on personal 
computers in their spare time. 

Popular Power, Inc. is ft company of- 
fering a new twist on this theme. Like 
SETI, a company computer feeds pieces 
of large computing problems to net- 
worked personal computers via their soft- 
ware program, Popular Power Worker, for 
idle-time operation. Popular Power's ap- 
proach differs, however, in providing a 
variety of computing problems to work 
on. These include nonprofit projects with 
no financial incentive to the personal 
computer owner, as well as commercial 
jobs that will eventually pay users for 
tasks performed on their machines. 

The current version of the Popular Pow- 
er Worker runs only on Windows and Lin- 
ux systems and is officially in pre-release 
form. The preliminary status of the soft- 
ware is readily apparent; numerous bugs, 
Sequent crashes, and difficulties in instal- 
lation plague the program currently. If in- 
formation at the company Web site is accu- 
rate, personal computer owners interested 
in Popular Power's computing model may 
find dealing with the problems of the early 
release worth their while. Users of the pre- 
release software are promised priority of 
access to commercial computing jobs after 
the official version is released. Popular 
Power Worker can be downloaded for free 
from the company's Web site, and it installs 
as a screen saver, which starts the program 



SolBKystems are planned. 

The benefits of the Popular Power 
scheme for distributed computing tasks do 
not accrue solely to the user whose com- 
puter is used. The flexible nature of Popu- 
lar Power's design provides access for 
businesses, scientists, and anyone with 
massive computing projects to computing 
power that is potentially far greater than 
they would gain from a fixed piece of 
hardware. Personal computer users might 
be able to select which com- 
mercial job to run through 
Popular Power Worker de- 
pending on the return offered 
by the originating contractor. 
A key to the success of the 
computing model is likely to 
be the price Popular Power de- 
==— mands for acting as the inter- 
face between the computing project cre- 
ators and the personal computer users. 

In summary, the current version of 
Popular Power Worker is still in the testing 
phase and users may find the software un- 
stable. Tech-sawy personal computer en- 
thusiasts are best suited to test the current 
pre-release product. The remaining users 
are advised to wait at least for the official 
release of the software. 

—Kevin ahern 
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running when it becomes active. Future 
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SOFTWARE 

Eyes on the Skies 

The orbital space above Earth con- 
tains an astonishing collection of 
man-made satellites. Tracking all of 
these objects is no small task. Liftoff is a 
NASA Web site that provides several soft- 
ware tools to locate, track, and identify 
Earth-orbiting 
satellites. At the 
Web site, three pro- 
grams are available: 
J-Pass (identifies 
satellites passing 
overhead); J-Track 
(allows one to track 
orbiting objects); 
and J-Track 3D (al- 
lows one to view satellites orbiting Earth 
from a perspective far away in space). 
Each of these platform-independent appli- 
cations is written in Java and is accessible 
from both Internet Explorer and Netscape 



J-Track, J-Track 3D, 
and J-Pass 
NASA 

Free 

http'7/Uftoff.msfc.nasa. 
gov/realtime/JTrack/ 
Spacecrafthtml 



www.sciencemag.org SCIENCE VOL 290 20 OCTOBER 2000 



473 



